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Description 

Method for computer-supported speech recognition, speech recognition system and 
control device for controlling a technical system and telecommunications device 

5 

The invention relates to a method for computer-supported speech recognition, a 
speech recognition system and a control device for controlling a technical system with a 
speech recognition system and a telecommunications device. 

1 0 Within the framework of computer-supported speech recognition, a speech signal 

which is spoken in by a user is digitized within the framework of pre-processing and mapped 
onto what are referred to as feature vectors and stored for the speech recognition which is to 
be carried out. 

15 The feature vectors have, depending on the application, a permanently predefined 

number of feature vector components which are usually ordered in the feature vector 
according to their significance within the framework of the speech recognition, usually 
ordered according to feature vector components with a decreasing information content 
(statistical variance become smaller). 

20 

However, in particular, in a speech recognition application in an embedded system, 
the computing power which is available and the storage space which is available are limited, 
for which reason in the currently known speech recognition applications there are frequently 
problems, in particular owing to a very high number of feature vector components. 

25 

[1] describes a method for calculating distances between a feature vector and a 
plurality of comparison vectors. In this method, the discrimination capability is determined 
for each of the components of the feature vector. For those components of the feature vector 
whose discrimination capability is worse than a predefined threshold value, a first part 
3 0 distance from a group of components of the comparison vectors is determined. For those 

components of the feature vector whose discrimination capability is better than the predefined 
threshold value, second part distances from the corresponding components of the comparison 
vectors are determined. The distances between the feature vector and the plurality of 
comparison vectors is determined from the first part distance and the second part distances. 

35 
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The invention is based on the problem of specifying a possibility for computer- 
supported speech recognition, and a speech recognition system, in which a reduced available 
computing power or a reduced available storage space is sufficient. 

The problem is solved by the method for computer-supported speech recognition, by 
the speech-recognition system, by the control device and by the telecommunications device 
with the features according to the independent patent claims. 

In a method for computer-supported speech recognition using feature vectors, a 
recognition rate information which is preferably determined at the start of the method is 
stored, said recognition rate information specifying for the feature vectors, as a function of 
the information content of the feature vector components, which speech recognition rate can 
respectively be achieved with the feature vectors with the feature vector components which 
are respectively taken into account. 

In a first step, the speech recognition rate which is required for the respective speech 
recognition application is determined for a speech recognition application. 

Using the stored speech recognition rate information, the computer determines which 
information content of the feature vector components is at least necessary to ensure the 
specific speech recognition rate. 

In addition, the number of feature vector components which are necessary in the 
speech recognition system for the respective speech recognition application in order to make 
available the determined information content is determined. 

In addition, a code book is preferably generated for the respective speech recognition 
application, taking into account the previously determined number of feature vector 
components in the speech recognition system. 

Then, preferably using the particular speech-recognition-application-specific code 
book, the speech recognition is carried out using feature vectors with the number of feature 
vector components which is necessary to make available the determined information content. 

The speech recognition, that is to say the method for comparing the feature vectors, 
and thus in particular the comparison of the feature vectors of the spoken speech signal with 
the feature vectors of reference words which are stored in an electronic dictionary, is carried 
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out using feature vectors with the number of feature vector components which is necessary to 
ensure the previously determined speech recognition rate. 

A speech recognition system has a speech recognition unit and an electronic 
5 dictionary which is connected to the speech recognition unit and in which the words which 
are taken into account within the framework of the speech recognition are stored. 

In addition, a recognition rate information memory in which recognition rate 
information is stored which specifies for the feature vectors, as a function of the information 

1 0 content of the feature vector components, which speech recognition rate can respectively be 
achieved with the feature vectors with the feature vector components which are respectively 
taken into account is provided in the speech recognition system. Before the actual speech 
recognition is carried out, the recognition rate information is determined, preferably with 
reference to a training data record, by means of a recognition rate information determining 

1 5 unit - which is also provided - in order to determine the recognition rate information. In 

addition, an information content determining unit is provided for determining the information 
content for feature vector components of a feature vector in the speech recognition system. In 
addition, a feature-vector-component selection unit is provided in the speech recognition 
system for the purpose of selecting feature vector components which are to be taken into 

2 0 account within the framework of the speech recognition. 

A control device for controlling a technical system has the speech recognition system 
described above, the control instructions which are provided for controlling the technical 
system for the purpose of, preferably speaker-independent, speech recognition being stored in 

2 5 the electronic dictionary. 

It is thus apparent that the invention makes it possible for the first time to take into 
account the actual application-specific requirements spoken for the recognition rate within 
the scope of the selection of feature vector components of feature vectors for speech 

3 0 recognition in a flexible way without a speech recognition rate having to be newly 

determined for each speech recognition application. 

In this way, an optimized compromise is achieved, in particular in terms of the 
available storage space requirement by means of application-dependent reduction of the 
3 5 dimension of the feature vectors, or in other words the number of feature vector components 
which are taken into account. The reduction in the number of feature vector components 
which are taken into account within the scope of the speech recognition leads to a 
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considerable reduction in the computer power required within the scope of the speech 
recognition itself. 

For this reason, the invention is particularly suitable for use in an embedded system. 

5 

In addition, a considerable saving is achieved in terms of required computing time as, 
for a new speech recognition application, only the number of necessary feature vector 
components from the recognition rate information which has previously been acquired only 
once need be determined and the code book can be acquired directly by using the feature 
1 0 vectors with the determined necessary number of feature vector components. 

Preferred developments of the invention emerge from the dependent claims. 

The refinements of the invention which are described below relate to the method, the 
1 5 speech recognition system and the control device. 

For the speech recognition itself, preferably a speech recognition method is carried 
out for speaker-independent speech recognition, particularly preferably using Hidden Markov 
Models. 

20 

Alternatively, statistical classifiers, for example using artificial neural networks, can 
be used for speech recognition, in particular for speaker-independent speech recognition. 

However, according to the invention, any desired method can generally be used for 

2 5 speech recognition. 

According to another refinement of the invention, there is provision for the feature 
vector components to be selected with relatively high information content from among the 
feature vector components of the respective feature vector and used within the scope of the 

3 0 speech recognition. 

This refinement of the invention ensures that actually those feature vector components 
which have the smallest information content within all the feature vector components are not 
taken into account, ensuring that the information which is lost within the scope of the speech 
3 5 recognition process which comes about owing to the fact that the feature vector component is 
not taken into account is minimized. 
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A control device for controlling a telecommunications device, for example a 
telephone device, a telefax device, a PDA, a notebook etc., or for controlling a terminal in 
which at least two of the device functionalities described above are integrated in a common 
device, is suitable as a control device for controlling a technical system. In particular, these 
5 devices which are to be controlled with a clearly defined and a limited vocabulary can be 
controlled by means of a speech dialogue which can be implemented in a relatively clearly 
organized way, and thus cost-effectively even by means of an embedded system. 



The application-adapted considerable reduction in the dimension of processed feature 
1 0 vectors leads to a considerable saving in time within the scope of the development of a 

speech recognition system, in particular the code book which is used is reduced considerably, 
which also reduces the storage space requirement to a considerable degree. 

An exemplary embodiment of the invention will be illustrated in the figures and will 
15 be explained below in more detail. In the drawings: 



Figure 1 shows a block diagram of a speech recognition system according to an exemplary 
embodiment of the invention; 

2 0 Figure 2 shows an outline of the memory of the computer from figure 1 in detail; 

Figure 3 shows a block diagram in which the individual method steps for determining a 
recognition rate information are illustrated in accordance with an exemplary 
embodiment of the invention; 

25 

Figure 4 shows a flowchart in which the individual method steps for determining a 
recognition rate information according to an exemplary embodiment of the 
invention are illustrated; 

3 0 Figure 5 shows an outline of a recognition rate information according to an exemplary 

embodiment of the invention; and 

Figure 6 shows a flowchart in which the individual method steps of the method for speech 
recognition are illustrated in accordance with an exemplary embodiment of the 
3 5 invention. 
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Figure 1 shows a speech recognition system 100 according to an exemplary embodiment 
of the invention. 

The speech recognition system 100 operates, depending on the operating mode, in a 
5 first operating mode as a speech recognition device, a spoken in utterance 101 which is 
spoken in the speech recognition mode by a user (not illustrated) of the speech recognition 
system 100 being recognized by the speech recognition device. The speech recognition is 
carried out using a method for speaker-independent speech recognition. 

10 In a second operating mode, also referred to below as training mode, the speech 

recognition system 100 is trained using an utterance 101, as explained in more detail below, 
and according to this exemplary embodiment this means that individual Hidden Markov 
Models are trained for an utterance by means of the utterance 101 spoken in. 

15 In both operating modes, the speech signal 101 which is uttered by the user is fed to a 

microphone 102 and subjected as a recorded electrical analog signal 103 to pre-amplification 
by means of pre-amplification unit 104 and fed as an amplified analog signal 105 to an 
analog/digital converter 106, converted therein to a digital signal 107 and fed as a digital 
signal 107 to a feature extraction unit 108 which subjects the digital signal 107 to spectral 

20 transformation and forms a sequence of feature vectors 109 with respect to the digital signal 
107 for an utterance, said feature vectors 109 representing the digital signal 107. 

Each feature vector 109 has a predefined number of feature vector components. 

2 5 According to this exemplary embodiment, the feature vectors each have 78 feature 

vector components. 

The feature vectors 109 are fed to a computer 1 10. 

30 It is to be noted in this context that the microphone 102, the pre-amplification unit 

104, in particular the amplification unit, and the analog/digital converter 106 as well as the 
feature extraction unit 108 can be implemented as separate units or else as units which are 
integrated in the computer 110. 

3 5 According to this exemplary embodiment of the invention there is provision for the 

feature vectors 109 to be fed to the computer 1 10 via its input interface 111. 
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The computer 110 also has a microprocessor 1 12, a memory 113 and an output 
interface 1 14, which are all connected to one another by means of a computer bus 115. 

The method steps which are described below, in particular the methods for 
5 determining the recognition rate information which is explained below and the speech 
recognition methods are carried out by means of the microprocessor 1 12. 

An electronic dictionary (explained in more detail below) which is stored in the 
memory 113, contains the entries in the form of trained Hidden Markov Models which are 
1 0 contained within the framework of the speech recognition process as reference words which 
only can be recognized, if at all, by the speech recognition algorithm. 

Alternatively, a digital signal processor may also be provided which has implemented 
the respectively used speech recognition algorithms and can have a microcontroller 
1 5 specialized therefor. 

In addition, the computer 1 10 is connected by means of the input interface 1 13 to a 
keyboard 1 16 and a computer mouse 1 17 via electrical leads 118, 1 19 or a radio link, for 
example an infrared link or a Bluetooth link. 

20 

The computer 1 10 is connected to a loudspeaker 122 and to an actuator 123 by means 
of the output interface 1 14 via additional cables or radio links, for example by means of an 
infrared link or a Bluetooth link 120, 121. 

25 The actuator 123 in figure 1 represents generally any possible actuator within the 

scope of the control of a technical system, implemented for example in the form of a 
hardware switch or in the form of a computer program in case, for example, a 
telecommunications device or another technical system, for example a car radio, a stereo 
system, a video recorder, a television, the computer 1 10 itself or some other technical system 

30 is to be controlled. 

According to the exemplary embodiment of the invention, the feature extraction unit 
108 has a bank of filters with a plurality of bandpass filters which measure the energy of the 
input speech signal 103 in the individual frequency bands. So-called short-term spectra are 
3 5 formed by means of the bank of filters by rectifying the output signals of the bandpass filters, 
smoothing them and sampling them at short intervals, according to the exemplary 
embodiment every 10 msec, or alternatively every 15 msec. 
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The Cepstrum coefficients which are formed by means of the feature extraction unit 
108 and which form 13 coefficients of the feature vectors 109 are stored as feature vector 
components of two successive time windows with a size of 10 msec or 15 msec in the feature 
5 vector 109. In addition, as feature vector component in the feature vector 109, in each case 
the first time derivative and the second time derivative of the Cepstrum Coefficients are 
combined in the feature vector 109 as super feature vector and fed to the computer 110. 

In the computer 1 10, a speech recognition unit is implemented in the form of a 
10 computer program and stored in a first memory area 201 (see figure 2) in the memory 1 13, 
which speech recognition unit is based on the principle of Hidden Markov Models. The 
computer program is thus used to perform speaker-independent speech recognition. 

At the start of the method, two different data records are formed with utterances 
1 5 which are spoken in by one or more users. 

A training data record, stored in a second memory area 202 of the memory 113 has 
those utterances, in the form of feature vectors for the respective utterances, which are used 
for the training (described in more detail below) of the Hidden Markov Models which are 

2 0 used for speech recognition. 

In a third memory area 203, a test data record, that is to say the utterances which are 
used for testing the trained speech recognition unit, in other words for testing the trained 
Hidden Markov Models which are stored in a fourth memory area 204, is stored. 

25 

A recognition rate information which is stored in a fifth memory area 205 is 
determined by means of the test data record, as explained in more detail below. 

In a sixth memory area 206, a table which is explained in more detail below and in 

3 0 which for one or more applications of the speech recognition system an indication of which 

recognition rate is required for the respective application is stored. 

In this context it is pointed out that the individual elements can be stored in different 
memory areas of the same memory 113, but also in different memories which are preferably 
3 5 adapted to the respective demands of the stored elements. 
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Figure 3 and Figure 4 show, in a block diagram 300 (cf. figure 3) or in a flowchart (cf. 
figure 4), the individual method steps which are carried out by the computer 1 10 in order to 
determine the recognition rate information stored in the fifth memory area 205. 

5 After the method starts (step 401), the individual Hidden Markov Models are trained 

in a training step using the training data record stored in the second memory area 202. 

The Hidden Markov Models are trained in accordance with this exemplary 
embodiment in three phases: 

• a first phase (step 402) in which the speech signals 301 contained in the training 
database are segmented by means of a segmentation unit 302, 

• a second phase (step 403) in which the LDA matrix (linear discriminance analysis 
matrix) is calculated, and 

• a third phase (step 405) in which the code book, that is to say the HMM prototype 
feature vectors are calculated for in each case one number of feature vector 
components, which number is selected in a selection step (step 404). 

These three phases are referred to in their entirety below as the training of the Hidden 
2 0 Markov Models (HMM training). 

The HMM training is carried out using the DSP 123 as well as using predefined 
training scripts, clearly of suitably designed computer programs. 

2 5 According to this exemplary embodiment, each phonetic unit which is formed, that is 

to say each phoneme, is divided into three successive phoneme segments, corresponding to 
an initial phase (first phoneme segment), a central phase (second phoneme segment) and an 
end phase (third phoneme segment) of a sound, that is to say of a phoneme. 

30 In other words, each sound is modeled in a sound model with three states, that is to 

say with a three-state HMM. 

During the speech recognition, the three phoneme segments are arranged in a row in a 
Bakis topology or generally a left-right topology and the calculation is carried out within the 

3 5 framework of speaker independent speech recognition on the concatenation of these three 

segments arranged in a row. 
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As will be explained in more detail below, a Viterbi algorithm is carried out in the 
speech recognition mode in order to decode the feature vectors which are formed from the 
speech signal 101 which is input. 

After segmentation has taken place, the LDA matrix 304 (step 403) is determined by 
means of an LDA matrix calculation unit 303. 

The LDA matrix 304 is used to transform a respective super feature vector y into a 
feature vector x according to the following rule: 




where 

• x is a feature vector, 

• A is an LDA matrix, 

• y is a super feature vector 

• y is a global displacement vector. 

The LDA matrix A is determined in such a way that 

• the components of the feature vector x are essentially uncorrected to one another in 
the statistical average, 

• the statistical variances within one segment class are normalized by the statistical 
average, 

• the centers of the segment classes on the statistical average are at a maximum distance 
from one another, and 

• the dimension of the feature vectors x is reduced as far as possible, preferably as a 
function of the speech recognition application. 

The method for determining the LDA matrix A is explained below with reference to 
these exemplary embodiments. 

However, it is to be noted that alternatively all known methods for determining an 
LDA matrix A can be used without restriction. 

It is assumed that there are J segment classes, each segment class j containing a set of 
Dy-dimensional super feature vectors y, that is to say the following applies: 
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Class 



Nj being the number of super feature vectors yj in the class j. 

5 N = £ N. (3) 

j=i 

designates the total of super feature vectors y. 

It is to be noted that the super feature vectors y* has been determined using the 

above-described segmentation of the speech signal database. 
10 According to this exemplary embodiment, each super feature vector y* has a 

dimension D y of 



D y = 78(=2-3 • 13) 

the super feature vector y* containing 13 MFCC coefficients (Cepstrum coefficients) as well 

15 as their respective first time derivative and their respective second time derivative (this is the 
reason for factor 3 above). 

In addition, in each case the components of two time windows which follow one 
another in direct temporal succession are contained in each super feature vector y\ within 

the framework of the short-time analysis (this is the reason for factor 2 above). 

20 

It is to be noted in this context that basically any desired number of vector 
components which are adapted to the respective application can be contained in the super 
feature vector y* , for example up to 20 Cepstrum coefficients and their associated first time 

derivatives and second time derivates. 

25 

The statistical average, or in other words the center of class j, is obtained according to 
the following rule: 
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i = l 



(4) 



The covariance matrix Ej of class j is obtained according to the following rule: 



The average intra scatter matrix S w is defined as: 



J 

§w = Sp(3) 'Sj . (6) 

10 



where 



p(j) being referred to as a weighting factor of the class j. 
Analogously, the average inter scatter matrix Sb is defined as: 




(8) 



where 



y = X PtJ) • Yj 
3»i 



25 



(9) 
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as the average super feature vector of all the classes. 

The LDA matrix A is decomposed according to the following rule: 

A = U- W- V 

• U designating a first transformation matrix, 

• W designating a second transformation matrix and 

• V designating a third transformation matrix. 

The first transformation matrix U is used to diagonalize the average intra scatter 
matrix S w and is determined by transforming the positively definite and symmetrical average 
intra scatter matrix S w into its own vector space. In its own vector space, the average intra 
scatter matrix S w is a diagonal matrix whose components are positive and greater than or 
equal to zero. The components whose values are greater than zero correspond to the average 
variance in the respective dimension which is defined by the corresponding vector 
component. 

The second transformation matrix W is used to normalize the average variances and is 
determined according to the following rule: 



yields the unit matrix for the matrix B T Sw B, said matrix remaining unchanged during any 
orthonormal linear transformation. 

In order to diagonalize the average inter scatter matrix Sb, the third transformation 
matrix V, which is formed according to the following rule: 




(11) 



The transformation U W is also referred to as a Weiss transformation. 



B = U W 



V = S T ■ S b - B , 



(13) 
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B T Sb B also representing a positive definite and symmetrical matrix, is transformed into its 
own vector space. 

In the transformation space 

5 

x - A T • (14) 

the following matrices are thus obtained: 
10 A diagonalized average intra scatter matrix S w : 

§ w = diag(l) d=1 ^ (15) 
and a diagonalized average inter scatter matrix Si>: 

15 

s b . aiag(0|) as=1 (16) 

diag(cd)d=i...Dy designating a D y x D y diagonal matrix with the component ca in the 
row/column d and otherwise with components with the value zero. 

20 The values a 2 d are the eigenvalues of the average inter scatter matrix Sb and represent 

a measure of the so-called pseudo-entropy of the feature vector components, which pseudo- 
entropy is also referred to as information content of the feature vector components. It is to be 
noted that the track of each matrix is invariant with respect to any orthogonal transformation, 
as a result of which the sum 



25 



° 2 = X a d (17) 
d=l 



represents the total average variance of the average vector xj of the J classes. 

3 0 This thus results in a determined dependence of the pseudo-entropy of the feature 

vectors from the feature vector components which are contained or taken into account 
respectively in the feature vector. 
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According to this exemplary embodiment, a dimension reduction is then performed by 
sorting the values into an order such that they decrease in size and the <5 2 d values which 

are smaller than a predefined threshold value are omitted, that is to say not taken into 
5 consideration. 

The predefined threshold value can also be defined cumulatively. 

The LDA matrix A can then be adapted by sorting the rows in accordance with the 
1 0 eigenvalues cr^ and omitting the rows which are associated with the sufficiently "small" 

variances, and thus only have a small information content (small pseudo-entropy). 

According to this exemplary embodiment, the components with the 24 largest 
eigenvalues <r^ are used, in other words 

15 

D x = 24. 

The four partial steps described above for determining the LDA matrix A 304 (step 
403) are summarized in the following table: 

20 



Number of method step 


Aim 


Method 


1 


Decorrelation of the 
feature vector 
components 


Diagonalization of 
average intra class 
covariance matrix S w 


2 


Normalization of the 
statistical variances 
within a class 


Determination of inverse 
square root of the 
transformed average intra 
class covariance matrix 
U T Sw'U 


3 


Maximization of class 
centers 


Diagonalization of 
transformed average inter 
class covariance matrix 
B T S b B 


4 


Reduction of dimensions 
of the feature vectors 


Selection of rows of the 
matrix A with the 24 
largest intrinsic values of 
A T 'S b A 



The last method relating to the part method within the framework of the training of 
the Hidden Markov Models is the clustering of the feature vectors (step 405) which is carried 
out by means of a cluster unit 305 and which has, as a result, a respective code book 306, in 
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each case specifically for a training data record with a predefined number of feature vector 
components. 

The representatives of the segment classes are referred to in their entirety as the code 
5 book and the representatives themselves are also referred to as prototypes of the phoneme 
segment classes. 

The prototypes, also referred to below as prototype feature vectors, are determined 
according to the Baum- Welch training described in [1]. 

10 

Thus the basic entries of the electronic dictionary, that is to say the basic entries for 
speaker-independent speech recognition, are generated and stored and the corresponding 
Hidden Markov Models are trained. 

1 5 There is thus in each case a Hidden Markov Model for each basic entry, by which 

means the code book 306 for the training data record with the selected number of feature 
vector components in the feature vectors in the training data record is formed. 

After the training of the Hidden Markov Models has taken place, the trained Hidden 
2 0 Markov Models are then present in the fourth memory area 204. 

In a subsequent method step (step 406), the recognition rate for the respective feature 
vectors of the current dimension, that is to say for the feature vectors with the respectively 
current number of feature vector components, is determined for those in the test data record 

2 5 which is stored in the third memory area 203. 

This is carried out in accordance with this exemplary embodiment in that, for all 
utterances, that is to say for all the sequences of feature vectors in the test data record, speech 
recognition is carried out by means of the trained Hidden Markov Models, in other words by 

3 0 means of a speech recognition unit 307, and the speech recognition results are compared with 

the setpoint results of the test data record. 

The recognition rate 308 which is determined is obtained from the ratio of the number 
of correct recognition results, in other words from the number of correspondences between 
3 5 the speech recognition result and the setpoint result which is specified in the test data record, 
and from the test data records which are represented in their entirety for speech recognition. 
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In a following step (step 304), the determined recognition rate is stored together with 
the specification of how many feature vector components are being used to determine the 
recognition rate 308 for the feature vectors of the test data records 203. 

5 Then, in a test step 407, it is checked whether the method is to be terminated. 

If this is the case, the method is terminated (step 408). 

If the method is not yet to be terminated, the number of feature vector components of 
10 the feature vectors 109 which are used within the framework of the determination of the 

recognition rate from the test data record are reduced by a predefined value, preferably by the 
value "1", that is to say by one feature vector component (step 409). 

The clustering steps (step 405), and thus the steps for the creation of the respective 
15 code book 306 and for determining the speech recognition rate (step 406) are then carried out 
anew, but now for feature vectors of the test data record with the feature vectors which are in 
each case reduced by one feature vector component. 

In other words this means that if there are 78 feature vector components in a usual 
2 0 feature vector according to this exemplary embodiment of the invention in the second 
iteration, the recognition rate for a feature vector is carried out with 77 feature vector 
components, with 76 feature vector components in the third iteration, etc. 

According to one alternative refinement of the invention there is provision that the 
2 5 process is not started immediately with all the feature vector components of the super feature 
vector (i.e. not with all 78 feature vector components), but rather a number of feature vector 
components which is reduced by an application-dependent value as early as at the start. 

In addition, in each iteration, the number of feature vector components can be reduced 
30 by more than the value C T\ 

Thus, as a result of this method described above, a pseudo-entropy mapping is present 
on the one hand, and a recognition rate mapping, on the other. 

35 The pseudo-entropy mapping is used to specify a dependence of the pseudo-entropy 

of the feature vectors on the feature vector components which are taken into account, that is 
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to say dependence of the information content, also referred to as quantity of information, on 
the feature vector components which are taken into account. 

A dependence of the speech recognition rate of the feature vectors on the feature 
5 vector components which are taken into account is specified with the recognition rate 
mapping. 

The recognition rate information is formed from the pseudo-entropy mapping and the 
recognition rate mapping by determining a dependence of the speech recognition rate on the 
1 0 pseudo-entropy using the respective feature vector components taken into account. It is to be 
noted that the recognition rate information is now independent of the number of feature 
vector components taken into account. 

The recognition rate information is stored in the fifth memory area 205. 

15 

The result of this method is thus the recognition rate information 500 which is 
represented in a functional diagram in figure 5 and which specifies the recognition rate 502 
reached in the form of data tuples 503 on a first axis on which the determined pseudo-entropy 
501 is plotted. 

20 

The recognition rate information 500 thus represents the relationship between the 
pseudo-entropy and the recognition rate which can be achieved by means of the speech 
recognition system. 

25 In this context it is pointed out that the recognition rate information 500 has to be 

carried out only once for each speech recognition system, that is to say for each trained set of 
Hidden Markov Models. 

Figure 6 shows, in a flowchart 600, the individual method steps for the speech 
30 recognition method according to the exemplary embodiment of the invention. 

After the method starts (step 601), the speech recognition application in the 
framework of which the speech recognition is to be carried out is selected or determined (step 
602). 

35 

According to this exemplary embodiment, the following speech recognition 
applications are provided as possible applications for speech recognition: 
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• a speech dialogue system: 

a speech recognition rate of 92 - 93% is to be ensured for a speech dialogue system; 

• a vehicle navigation system: 

5 a speech recognition rate of approximately 95% is to be ensured for this speech 

recognition application; 

• a control of a technical system, according to the exemplary embodiment of a video 
recorder: 

for this speech recognition application a speech recognition of approximately 95% is 
10 to be ensured; 

• a telephone application: 

for this application a speech recognition rate of 95% is to be ensured; 

• dictation, in other words the recognition of speech information and conversion of the 
recognized speech signal into a text processing system: 

1 5 for this application the maximum speech recognition rate which can be achieved with 

the speech recognition system is necessary, that is to say in this case it is not 
appropriate to reduce feature vector components. 

For the respective speech recognition application, segmentation of the super feature 
2 0 vectors (step 603) is carried out in the same way as described above, with a training data 

record which is also stored in the second memory area 202 and is preferably dependent on the 
speech recognition application. 

Then, an LDA calculation is also carried out (step 604), likewise in the same way 

2 5 described above, by which means an LDA matrix 605 which is dependent on the speech 

recognition application is determined. 

Using the LDA matrix 605 which is dependent on the speech recognition application, 
a pseudo-entropy mapping which is dependent on the speech recognition application and 

3 0 which represents a relationship between the pseudo-entropy which can be achieved and the 

number of feature vector components respectively taken into account in each feature vector is 
determined. 

The respective pseudo-entropy mapping which is dependent on the speech recognition 
3 5 application is stored in the sixth memory area 206. 
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The necessary pseudo-entropy is determined (step 606) using the previously 
determined, required speech recognition rate and the recognition rate information which is 
stored in the sixth memory area 206 for the selected application in an additional step. 

5 Using the pseudo-entropy mapping - in the form in which it has previously been 

determined - which is dependent on the speech recognition application, it is determined in a 
subsequent step (step 607), how many feature vector components and which feature vector 
components, according to this exemplary embodiment the feature vector components which 
respectively have the smallest information content, can be omitted, in other words may not be 
1 0 taken into account, from the framework of the speech recognition. 

If the number of required feature vector components for the selected application is 
then determined in step 607, in a following step clustering is carried out (step 608) for the 
respective application and for the specific number of feature vector components. The result of 
1 5 the clustering is a code book 609 which is dependent on the speech recognition application, in 
other words a set of trained Hidden Markov Models which are dependent on the speech 
recognition application and which are also stored in the memory. The cluster method is 
equivalent to the cluster method described above (step 405) for determining the recognition 
rate information 500. 

20 

Then, the speaker-independent speech recognition is carried out using the stored code 
book 609 which is dependent an the speech recognition application (step 610). 

In other words this means that a subsequently spoken in utterance of a user is carried 

2 5 out using the Hidden Markov Models according to the method described in [1] for speaker- 

independent speech recognition using the Viterbi algorithm (step 610). 

As described previously, within the framework of the speech recognition, the reduced 
feature vectors are taken into account, that is to say the feature vectors without the feature 

3 0 vector components which are not taken into account. 

In other words this means that in the case of k feature vector components in a feature 
vector and n feature vector components (n < k) which are not taken into account, only (k - n) 
feature vector components have to be taken into account within the scope of the speech 
35 recognition. 
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The comparison thus also takes place in a comparison space with the dimension (k - 

n). 

In addition, according to the invention the recognition rate information is determined 
only once; for each new speech recognition application all that is necessary is to determine, 
using the recognition rate information 500, how many, and preferably which, feature vector 
components are necessary for the new speech recognition application, and to determine the 
code book for the determined number of required feature vector components. 

Figure 5 shows the example in which a speech recognition rate of 95% is required for 
the selected application, represented in figure 5 by means of an intersection line 504. 

Data points located above the intersection line represent a pseudo-entropy which is 
greater than would actually be necessary for the requirement of the selected application, in 
other words in order to ensure a recognition rate of 95%. 

According to this exemplary embodiment, two feature vector components may be 
omitted, as a result of which it was possible to reduce the dimension of the processed feature 
vectors by the value 2. 

It is clear that the invention can be considered to comprise the fact that, under certain 
conditions, a lower recognition rate of the speech recognizer can be accepted for a specific 
selected speech recognition application, for example from the Command and Control area, 
and the invention implements this recognition as a reduction of the dimension of the 
processed feature vector. 

After speech recognition has taken place in step 610, the method is terminated (step 

611). 
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